In this section of our analysis, we will take an unsupervised machine learning approach and look for natural clusters / segments within our world happiness data. We have already seen some initial correlations between economic indicators and happiness, so we expect to encounter some stable clusters.
We will utilize the k-means clustering algorithm and perform a bootstrap validation to verify cluster stability.
## $hopkins_stat
## [1] 0.3154874
##
## $plot
With a relatively low Hopkins statistic, we can conclude that this dataset is not inherently clusterable. We can produce clusters, but it is likely that the boundaries between clusters will be softly defined. This is expected given the variety of indicies we are measuring and the inherent heterogenity of world countries. We expect to achieve better clustering results when applied to the output of PCA / Factor Analysis vs. our relatively raw dataset.
set.seed(24286)
km.res <- kmeans(cluster_df, 7, nstart = 25)
# Visualize
library("factoextra")
fviz_cluster(km.res, data = cluster_df,
ellipse.type = "convex",
palette = "jco",
ggtheme = theme_minimal())
## Warning in if (color == "cluster") color <- "default": the condition has
## length > 1 and only the first element will be used
## Warning in fanny(cluster_df, 2): the memberships are all very close to 1/k.
## Maybe decrease 'memb.exp' ?
## cluster size ave.sil.width
## 1 1 70 0.30
## 2 2 62 0.29
##
## Clustering Methods:
## hierarchical kmeans pam
##
## Cluster sizes:
## 2 3 4 5 6 7 8 9 10
##
## Validation Measures:
## 2 3 4 5 6 7 8 9 10
##
## hierarchical Connectivity 11.5111 17.6421 24.7548 29.1127 36.5270 39.9643 46.9325 47.9325 56.2524
## Dunn 0.2787 0.2848 0.2911 0.2911 0.3372 0.3372 0.3966 0.3966 0.4276
## Silhouette 0.3203 0.2573 0.2538 0.2286 0.2465 0.2374 0.2386 0.2251 0.2021
## kmeans Connectivity 23.4702 43.3377 46.2397 41.0754 58.3714 60.6087 69.1726 81.1623 78.3413
## Dunn 0.2194 0.1660 0.1997 0.3156 0.2698 0.2730 0.2619 0.3054 0.3624
## Silhouette 0.3203 0.2445 0.2565 0.2514 0.2471 0.2405 0.2120 0.1930 0.1859
## pam Connectivity 34.2468 45.0520 42.3103 47.1429 78.7536 88.0032 101.4996 98.8163 113.6448
## Dunn 0.2194 0.1980 0.2649 0.3089 0.3068 0.3068 0.3068 0.3138 0.3138
## Silhouette 0.3141 0.2298 0.2322 0.2418 0.1996 0.1915 0.1700 0.1730 0.1472
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 11.5111 hierarchical 2
## Dunn 0.4276 hierarchical 10
## Silhouette 0.3203 hierarchical 2